Unraid 6.8.3 / 6.9.0 您所在的位置:网站首页 unraid 683 key Unraid 6.8.3 / 6.9.0

Unraid 6.8.3 / 6.9.0

#Unraid 6.8.3 / 6.9.0| 来源: 网络整理| 查看: 265

Dear all,

I need help to troubleshoot聽my server which is haunted by a variant of the infamous system freeze of Ryzen systems (presumably). However, so far none of the advice I found here in the forum or聽on the Internet has resolved the issue and it seems to only appear under specific circumstances (the timing/occurrence of the freeze still seems random, though). Unfortunately, the syslog shows nothing relevant (i.e. error, warning, or kernel panic). I attached several syslogs that end at the time of the crash (not all have the hardware config below, since I also attempted to see whether it was a config issue and a fresh system with different array and only minimal plugins聽would solve it) as well as the diagnostics (before the crash, because afterwards the server is completely frozen and the web interface and ssh are not usable anymore).

Any help with this is highly appreciated!

How the issue manifests and when it appears

When the issue appears, the system freezes completely. The web interface and ssh are not accesible and all VMs are frozen as well. Even the segment display CPU thermal indicator on the mainboard seems to freeze.

Running just the server seems to not cause any freezes, but running multiple VMs seems to be the surest way for me to cause a system state that freezes the server. I can run a Windows 10 VM with any of the two GPUs (see below) passed through without any problems. The Windows VM is configured (with respect to energy settings etc.) following the video by spaceinvader one.

However, even attempting the install of Mac OS with just VNC graphics will (most of the time) freeze the server already during install without the dedicated GPU being passed through to the Mac OS VM. After a successful install, starting the Mac OS VM in addition to the Windows VM will cause a freeze after a seemingly random amount of time. This seems to appear faster when the Mac OS VM is in the "locked screen" state (I am not sure what energy state this entails). A similar situation appears when installing another Windows VM. The install will go through most of the time though, when the secondary GPU is not assigned to the VM. However, when the RX 580 is assigned to the second VM during install it will pretty accurately crash with the "copying files" step at 100%. If the GPU is passed through after the install completed, the system will freeze up after a seemingly random interval, just like it does with the Mac OS VM. At the same time, any running Ubuntu server VMs (e.g. serving WebDAV) will not affect the freezes in any way.

However, running multiple VMs seems to be only one way to get the system into a freezing state. The first time this happened for me pretty consistently was by running a parity check after upgrading my array from two 2TB drives to two 5TB drives (in both cases one formatted as btrfs and one as parity). This caused a freeze most of the time, regardless of whether any VM was running or not.

Hardware of the server

Here a description of the hardware in that server. I have two GPUs in this server in order to be able to run a Windows VM at all times with the RTX 2080 and have the option to run either a Mac OS VM or desktop Linux VM with the RX 580 at the same time. The server boots in legacy mode. No dockers are actively running. The currently running version of Unraid is 6.9.0-RC2, but the issue was also present in 6.8.3 before upgrading.

CPU: Threadripper 2920X MB: X399M Taichi (Bios 3.80, latest as of posting this) RAM: 64 GB ECC DDR4 (rated for 3200 MHz, but clocked at 2666 for troubleshooting since this is the BIOS default and below the max indicated by this FAQ post: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173) GPU 1: Asus RTX 2080 Dual OC (in PCIe slot 1) GPU 2: Sapphire RX 580 Nitro+ (in PCIe slot 3) PSU: Corsair SF750 Cache Pool 1: 2x Samsung 970 Evo M.2 1TB聽 Cache Pool 2: 1x Crucial 2TB MX500 Array: 1x 3 TB Hitachi Drive (formatted as xfs, currently only this disk present to troubleshoot as explained below)

Installed Plugins

- CA BAckup / Restore Appdata - CA Dynamix Unlimited Width - Community applications - Custom Tab - Dynamix Cache Directories - Dynamix SSD Trim - Dynamix System Temperature - Fix Common Problems - Nerd Tools - Unassigned Devices - Unassigned Devices Plus - unBALANCE - User Scripts

BIOS/UEFI settings not at BIOS defaults currently

IOMMU is set to "Enabled" To run VMs

Idle Current is set to "Typical Current Idle" This was the first setting I applied for this new build, since it is suggested in the FAQ in the forum here (https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173).

Things I already tried to remedy the issue (not strictly in the order I tried them, but approximately)

Disabling "Global C-State Control" in BIOS This did not have any noticeable effect. Also it was not mentioned in the FAQ post here in the forum (https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173). Therefore, I set it to "Auto" again. 聽 Disabling "Precision Boost Overdrive" in BIOS This was suggested in some post on reddit, but I cannot find it anymore. However, it also did not have any noticeable effect and was not mentioned in the FAQ post, so I set it to "Enabled" again. 聽 Running Memtest to see if RAM was faulty I thought maybe the stock 3200 of the RAM sticks was too much and ran memtests. The first one was at 3200 with the two ECC sticks. Despite seeing no errors over night, I still decreased the clock to 2933 (as suggested in the FAQ post) and ran another test, again without any errors. I then switched to two 8 GB non-ECC sticks temporarily to perform another test, again with no errors showing. Ultimately, I further decreased the RAM clock to 2666, which is the BIOS default, just to be sure. 聽 Replacing the SMR array disks with a CMR disk This was prompted by the parity check failing after the disk replacement. However, it did not affect the freezes. 聽 Adding the kernel boot parameters "processor.max_cstate=5", "rcu_nocbs=0-23", and "idle=nomwait" This was suggested in the following post on the Arch Linux forums: https://bbs.archlinux.org/viewtopic.php?id=245608. The settings considerably delayed the occurrence of freezes from minutes/few hours to several hours (e.g. over night). Since it seemed to have at least some effect, I left them in the syslinux conf. My understanding is that "max_cstate=5" is equivalent to running the zenstates utility in order to disable C6, therefore I did not try this as well. Or does this sth. different/additionally? 聽 Enabling SR-IOV and ACS in BIOS This was a long shot and the idea was that even when the GPUs do not support the features since passing through two GPUs seems accelerate the issue it might have an effect. However, it did not, so I reverted it to BIOS defaults. 聽 Enabling "Deep Sleep" in BIOS This was suggested to fix idle/instability issues in this reddit post: https://www.reddit.com/r/Amd/comments/8yzvxz/ryzen_c6_state_sleep_power_supply_common_current/. It did not seem to affect the freezes, so it is now disabled again.

I am pretty much at a loss what I can try in addition. Also that the syslog doesn't show anything at all has me quite puzzled, I never had that before with any Linux issues. Any pointers, comments, hints, suggestions, etc. are highly appreciated!

Thanks a lot!

EDIT: Had a copy & paste error in point 5. I since I have a 2920X the boot parameter was of course "rcu_nocbs=0-23".聽

cubezero-diagnostics-20201231-1841.zip syslog_20201225-02_fresh syslog_20201225-01_fresh syslog_20201224-01_fresh syslog_20200806-01_org

Edited January 1, 2021 by ledon


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有